Skip to content

Conversation

linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Sep 21, 2025

What this PR does / why we need it?

This miscellaneous​ contains several small fixes:

  1. fix initialization and forward bugs of DeepseekMTPLayer with shared_expert_dp enabled.
  2. fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner.
  3. avoid unnecessary decline of kv_cache memory (default: 64MB) with use_cached_kv_cache_bytes disabled.
  4. fall back fused_moe_state from MC2 to All2All since the padding logic of mc2_mask is incompatible with input hidden_states when shared_expert_dp enabled.

Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with deepseek_mtp and shared_expert_dp as v0.9.1-dev branch. The remaining problem of kv_cache tokens decline compared to v0.9.1-dev will be resolved by #3073.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

E2E vllm serving about deepseek_mtp with torchair graph mode and enable_shared_expert_dp with eager mode. Large ep deployments are also tested with this PR.

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in DeepSeek's Multi-Token Prediction (MTP) when enable_shared_expert_dp is active. The core of the fix involves conditionally using TorchairDeepSeekMTP for this configuration. Additionally, the logic for handling attention metadata has been updated to be more robust by avoiding a hardcoded dictionary key. My review points out a potential crash in this new metadata handling and provides a suggestion for a safer implementation.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the fix_deepseek_mtp_with_shared_expert_dp branch from 9669e03 to b359070 Compare September 21, 2025 19:34
…ange in NPUModelRunner

Signed-off-by: linfeng-yuan <1102311262@qq.com>
…_cache_bytes disabled

Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan changed the title [bugfix] fix deepseek mtp with enable_shared_expert_dp [misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes Sep 22, 2025
@linfeng-yuan linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Sep 22, 2025
Copy link
Contributor

Hello! You've invoked me with /gemini. If you intended to use a specific command, please specify it after /gemini. For a list of available commands and how to use them, please refer to the help message by typing /gemini help or checking the "Using Gemini Code Assist" section in the PR description.

@linfeng-yuan linfeng-yuan removed ready read for review ready-for-test start test by label for PR labels Sep 22, 2025
@linfeng-yuan linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Sep 22, 2025
@linfeng-yuan
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several targeted fixes for bugs related to deepseek_mtp, enable_shared_expert_dp, and KV cache memory management. The changes correctly adjust MoE state handling and metadata synchronization for shared_expert_dp, prevent an unnecessary reduction in available KV cache memory when use_cached_kv_cache_bytes is disabled, and improve the robustness of accessing attention metadata. The fixes appear correct and well-aligned with the stated objectives. I have no major concerns with this pull request.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the fix_deepseek_mtp_with_shared_expert_dp branch from 1d27f71 to 6fd21e8 Compare September 23, 2025 04:07
@linfeng-yuan
Copy link
Collaborator Author

@wangxiyuan I pushed a new commit to tackle the last problem with large ep deployments and fix the UT break. Please review this commit and check whether there is any blocking problem to merge this pr.

@wangxiyuan wangxiyuan merged commit d01fd1d into vllm-project:main Sep 23, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:tests ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants